Zoom API Design Evaluation and Latency Budget
Analyze the non-functional requirements and estimate the response time of the Zoom meeting API.
We'll cover the following
Introduction#
Modeling a complex service is a time-consuming process that may require many rounds of fine-tuning. In this lesson, we’ll discuss how we can achieve the non-functional requirements, especially real-time communication, and estimate the response time of our proposed Zoom meeting API.
Non-functional requirements#
Let's discuss the non-functional requirements for our Zoom API one by one:
Availability and reliability#
We ensure the availability of our services by dividing servers according to different roles. For example, the meeting service handles requests to create, update, add participants, and so on, while the media controller handles client requests for managing meeting sessions. By adopting a role-based style, we can separate different workflows. In the event of a failure, if one service goes down, the other can still run normally, making our system resilient to complete outages. Additionally, services and data are replicated across different geological regions to avoid single points of failure (SPOF). We also have API monitoring and circuit breakers to identify and handle bad situations as quickly as possible. We limit concurrent meeting requests based on the account type for efficient resource management. For free users, we also limit the maximum time for a meeting to avoid the overloading of servers.
Security#
We use TLS/1.3 for normal communication, and to exchange AES keys for multimedia transmission. After successfully sharing the key, the connection is upgraded to WebSockets for AES-encrypted data transfers. We implement authentication/authorization using a login mechanism and OAuth, and OpenID Connect with PKCE flows for third-party interactions (see: the authorization framework). Connecting to the media router requires an access token. Guest (unregistered) participants can also join using their access token, which is only issued when the host accepts their join request.
Scalability#
Locally distributed media routers make scaling services easier. We also have decoupled media routers and media controllers, which allow us to deploy multiple media routers in an area controlled by a single controller, making this a cost-effective solution. Stateless communication between the conferencing service and the media controller allows efficient resource management during workload peaks.
Point to Ponder
Question
What determines the maximum number of users a service like Zoom can handle in a single meeting?
It is difficult to know the exact participant count that a service can handle. Determining the upper limit of a service depends on many factors (some of which are dynamically changing):
- Device specifications such as processing power, memory, etc.
- The number of features provided by the services and the extent to which those features interact.
- The number of components involved and their capabilities, such as databases, third-party services (if any), etc.
- Number of active users. For example, there could be a hundred people in a meeting, but only a few actually have their cameras on sharing video. Compare that to 40 participants, all sharing videos and actively collaborating. Then the latter case would put more load on the service than a hundred inactive users.
- Many other factors, such as rate limiting, data type, average response time, available bandwidth, and so on.
All of these factors together determine the operability of a service with good quality, and we can only estimate (at the design level) a safe number that a service should support based on the service’s SLA. Real services will run empirical tests to come up with a reasonable number. For services like Zoom, it’s recommended to support at least 500 participants per meeting.
Optimization and tradeoffs#
The stateful nature of WebSockets can be a scalability issue for our service, which is inevitable due to the two-way and real-time nature of the service. However, we may scale our service by increasing the number of regional media servers, which is an expensive solution, but there is always some sort of tradeoff.
Additionally, because we’ve learned from a previous lesson that MCU requires computation and SFU cast bandwidth, we can take a hybrid approach by having servers act as both. The server can intelligently switch to MCU when there is limited bandwidth, and can shift to simulcast SFU when the network conditions are better. We can also create client meshes within enterprise networks for meeting with a small number of participants. Moreover, we can preallocate resources for scheduled meetings with the expected values of attendees.
This allows us the flexibility to create either peer-to-peer communication (for small groups on the same network) or employ media servers where effective in an enterprise network. By adopting this enhanced version, we can have the following benefits:
We can independently manage rooms for small groups that are meeting within the enterprise networks and offload some of the work from media servers. With this approach, client devices will interact directly by creating a peer-to-peer connection, and we can always shift to servers when the number of participants exceeds a certain threshold.
We can improve the overall user experience by moving from SFU to a hybrid (MCU and SFU) approach that allows the service to adapt to network conditions more effectively.
Low latency#
Routing media through locally distributed servers reduces overall user-perceived latency. These servers act as simulcast SFUs, and adaptively upscale or downscale video resolution based on the network conditions. We also deploy media controllers in different geographic regions to facilitate a smooth user experience in joining and controlling meeting sessions. Furthermore, the communication between the client and the media server is based on WebSockets, which is relatively faster than HTTP-based communication, helping us achieve fast bidirectional data flow with low latency.
Achieving Non-Functional Requirements
Non-Functional Requirements | Approaches |
Availability and reliability |
|
Security |
|
Scalability |
|
Low latency |
|
Latency budget#
Latency estimation for our Zoom API involves calculating the latency of the following three events:
Joining a meeting
Setting up a session
Exchanging video clips
As discussed in the back-of-the-envelope calculations for latency, the latency of the
GETandPOSTrequests are affected by two different parameters. In the case ofGET, the averageremains the same regardless of the data size (due to the small request size), and the time to download the response varies by per KB. Similarly, for POSTrequests, thetime changes with the data size by per KB after the base RTT time (the minimum RTT taken by a request with the smallest data size), which was .
Let's discuss each of the above points one by one:
Joining a meeting#
Clients initiate simple GET requests to obtain meeting details. Let's calculate the message size and using that message size, also estimate the response time for this request to complete.
Request and response size#
Let's assume that the response size of the request is 4 KB, which contains the meeting details, such as start_time, agenda, and settings, etc.
Response time#
We can put the response size in the following calculator and estimate the minimum and maximum response time for a joining meeting wait room.
Response Time Calculator to Join a Meeting
| Enter the size in KBs | 4 | KB |
| Minimum latency | f192.1 | ms |
| Maximum latency | f273.1 | ms |
| Minimum response time | f196.1 | ms |
| Maximum response time | f277.1 | ms |
Assuming the response size is 4 KBs, the latency is calculated by:
Similarly, the response time is calculated using the following equation:
Now, for minimum response time, we use the minimum values of base time and processing time:
Now, for maximum response time, we use the maximum values of base time and processing time:
Setting up a session#
We use standard HTTP POST requests to store client sessions on the media controller server. Let's also calculate a rough estimate of the response time for storing a user session successfully.
Request and response size#
Let's assume that, on average, the session description has a size of 2 KB, which contains information about media type, media encoding, required bandwidth, and so on. The estimated size for the response returned is 2 KB, which contains the access token and configuration ID created on the media controller. We know from previous lessons that the response size of 2 KB for a POST request is standard. Therefore, we can only use the request size to calculate the response time.
Response time#
Let's put the request size in the following calculator to get the minimum and maximum response for storing a session on the media controller:
Response Time Calculator to Share a Session Description
| Enter the size in KBs | 2 | KB |
| Minimum latency | f383.2 | ms |
| Maximum latency | f464.2 | ms |
| Minimum response time | f387.2 | ms |
| Maximum response time | f468.2 | ms |
Assuming the request size is 2 KBs:
Similarly, the response time is calculated as follows:
Exchanging video clips#
This event consists of two steps:
The HTTP upgrade request
The data exchange on WebSockets
As we know from our previous discussions, upgrade requests are only made by exchanging HTTP headers and can take a maximum of
Now, let's move on to the next step and take an example where a person sends a full HD stream of 1080p resolution and receives a similar HD stream, along with ten low-quality streams of 144p thumbnail-sized videos. Let's calculate the latency of a one-second clip sent and received by the client.
Message size#
Assuming the encoding used to send and receive the stream is high-quality H264, and the devices are configured to send the stream at 30fps, the estimated size of one second of video for such a setup is as follows:
Assuming the size of each 144p thumbnail-sized video is 12 KB in the same configuration, and the one-second 1080p clip size is the same as above, we can calculate the incoming message size as follows:
Response time#
Knowing that data is transferred over an already established WebSockets connection, we can only calculate the transfer time using the following formula:
Note: Here, we can ignore other factors such as base time, request compile time, and so on because WebSockets do not follow the request-response model.
Let's use the following calculator to add the forwarding time taken by our simulcast SFU server and get the overall estimated response time:
User Perceived Latency Calculator for Exchanging Video Clips
| Enter the chunk size in KBs | 614.4 | KB |
| Thumbnail-sized video in KBs | 12 | KB |
| No. of thumbnail-sized videos | 10 | Integer |
| Outgoing stream latency | f280.76 | ms |
| Incoming stream latency | f328.76 | ms |
| User perceived latency | f613.52 | ms |
The overall latency budget for exchanging video clips using our Zoom API is summarized in the illustration below:
Note: The above response times are calculated using formulas derived for HTTP requests. But generally, WebSockets are more lightweight than HTTP because their headers are relatively small, which will further reduce the response time of each request. Furthermore, these calculations are performed for distant client-server communication, whereas our design takes server proximity into account. Therefore, we can conclude that while these numbers are estimated to be high, they will be quite low when served from nearby locations.
Summary#
In this lesson, we discussed how our API meets non-functional requirements. We also learned how to further improve the performance and scalability of the service. Finally, we estimated the average time for the expected response from the API.
API Model for Zoom Service
Requirements of the LeetCode API